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O , Abstract 
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Graph clustering is the problem of identifying sparsely connected dense 
c/2 ■ subgraphs (clusters) in a given graph. Proposed clustering algorithms usu- 

ally optimize various fitness functions that measure the quality of a cluster 
within the graph. Examples of such cluster measures include the conduc- 
^ 1 tance, the local and relative densities, and single cluster editing. We prove 

that the decision problems associated with the optimization tasks of find- 
ing the clusters that are optimal with respect to these fitness measures are 
NP-complete. 



1 Introduction 

Clustering is an important issue in the analysis and exploration of data. There 
is a wide area of applications in data mining, VLSI design, parallel computing, 
web searching, software engineering, computer graphics, gene analysis, etc. See 
also ^2] for an overview. Intuitively clustering consists in discovering natural 
groups (clusters) of similar elements in data set. An important variant of data 
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clustering is graph clustering where the similarity relation is expressed by a graph. 
In this paper, we restrict to unweighted, undirected graphs with no self-loops. 

We first recall some basic definitions from graph theory. Let G = (V, E) be an 
undirected graph and denote by E(S) = {{u, v} G E ; u,v G S} the set of edges in 
a subgraph G(S) = (S, E(S)) induced by a subset of vertices S C V. We say that 
S C V creates a clique of size \S\ if edges in E(S) = {{u, v} ; u,v G S, u ^ v} 
join every two different vertices in S. Further denote by da(v) = \{u G V; 
{u,v} G the degree of vertex t> G V in G. We say that graph G is a cubic 
graph if da(v) = 3 for every v G V. Moreover, any subset of vertices A C V 
creates a cu£ of G, that is a partition of into disjoint sets A and V \ A. The 
szze of cut A is defined as 

and 

d G (S) = J2 d G(v) (2) 

veS 

denotes the sum of degrees in cut S C V. 

A canonical definition of a graph cluster does not exist, but it is commonly 
agreed that a cluster should be a connected subgraph induced by a vertex set S 
with many internal edges E(S) and few edges to outside vertices in V\S jUEj. In 
this paper we consider several locally computable fitness functions that are used 
for measuring the quality of a cluster within the graph. The prominent position 
among graph cluster measures is occupied by the conductance jSl 13 El HOI El 
which is defined for any cut ^ S C V in graph G as follows 

® G[S) = mm(d G (sf,dl(V\S)) ■ (3) 

Furthermore, the local density 5q{S) [22] (cf. the average degree [IT] ) of a subset 
7^ S C V in graph G is the ratio of the number of edges in subgraph G(S) 
induced by S over the number of edges in a clique of size \S\ vertices, that is 

lo ,_\E(S)\ 2-\E(S)\ 

for S* containing at least two vertices whereas define 5q(S) = for |,S| = 1. 
Similarly, we define the relative density [THJ of cut ^ S C V as follows 



Yet another graph cluster measure which we call single cluster editing (cf. |20j ) 
of a subset S C V counts the number of edge operations (both additions and 
deletions) needed to transform S into an isolated clique: 

e a (S) = Of) - \E(S)\ + ca(S) . (6) 
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Proposed clustering algorithms [3 H2 UH UH] usually search for clusters that 
are optimal with respect to the above-mentioned fitness measures. Therefore the 
underlying optimization problems of finding the clusters that minimize the con- 
ductance or maximize the densities or that need a small single cluster editing 
are of special interest. In this paper we will formally prove that the associated 
decision problems for the conductance (Section EJ), local and relative densities 
(Section OJ), and single cluster editing (Section HJ) are NP-complete. These com- 
plexity results appear to be well-known or at least intuitively credible, but not 
properly documented in the literature. 

2 Conductance 

Finding a subset of vertices that has the minimum conductance in a given graph 
has been often stated to be an NP-complete problem in the literature [3 El El El 
ITU ITHl IT7j . However, we could not find an explicit proof anywhere. For example, 
the NP-completeness proof due to Papadimitrou [21] for the problem of finding 
the minimum normalized cut which is in fact the conductance of a weighted graph 
does not imply the hardness in the unweighted case. Thus we provide the proof 
in this section. The decision version for the conductance problem is formulated 
as follows: 

Minimum Conductance (CONDUCTANCE) 

Instance: An undirected graph G = (V, E) and positive integer </>. 

Question: Is there a cut S CV such that $g(<S i ) < 0? 

Theorem 1 Conductance is NP-complete. 

Proof: Clearly, Conductance belongs to NP since a nondeterministic algo- 
rithm can guess a cut SC]/ and verify $>g(S) < in polynomial time. For the 
NP-hardness proof the following maximum cut problem on cubic graphs will be 
reduced to Conductance in polynomial time. 

Maximum Cut for Cubic Graphs (Max Cut 3) 
Instance: A cubic graph G = (V, E) and positive integer a. 
Question: Is there a cut Acy such that cg{A) > a? 

The Max Cut-3 problem was first stated to be NP-complete in j2Bj which be- 
came a widely used reference jHj although an explicit proof cannot be found there 
and we were unable to reconstruct the argument from the sketch. Nevertheless, 
the NP-completeness of Max Cut-3 follows from its APX-completeness pre- 
sented in p. The following reduction to Conductance is adapted from that 
used for the minimum edge expansion problem [TB*] . 

Given a Max Cut-3 instance, i.e. a cubic graph G = {V,E) with n = \V\ 
vertices, and positive integer a, a corresponding undirected graph G' = (V, E') 
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for Conductance is composed of two fully connected copies of the complement 
of G, that is V = V x U V 2 where V t = {w* ; v e V} for % = 1,2, and £" = 
J?i U J? 2 U # 3 where = {{u\ v*}; u,v eV,u^ v, {u, v} E} for i = 1, 2, and 
£3 = {{u 1 ,^ 2 } ; it, w G V}. In addition, define the required conductance bound 

n- 2 -). (7) 



2n — 4 \ n , 

The number of vertices in G' is |V'| = 2n and the number of edges \E'\ = (2n—A)n 
since 

da'(v) = 2n — 4 for every dgV' (8) 

due to G is a cubic graph. It follows that G' can be constructed in polynomial 
time. 

For a cut ^ 5* C V in (?' with /c = |5| < 2n vertices denote by 

5, = {t>6y;!)'6 5} for i = 1, 2 (9) 

the cuts in G that are projections of S to Vi and V 2 , respectively. Since cg'(S) = 
c G '(V'\S) it holds ® G >(S) = $ G ,(V'\S) according to definition ©. Hence, k < n 
can be assumed without loss of generality when computing the conductance in 
G'. Thus, 

$G ' (5) = (2n-4)-|5| (10) 

follows from condition (jHJ) and the fact that G' is composed of two fully connected 
complements of G, which can be rewritten as 

* , (5)= 1 fan-t-SzW+faW). (ii) 

2n — 4 \ A; / 

Now we verify the correctness of the reduction by proving that the Max 
Cut-3 instance has a solution if and only if the corresponding Conductance 
instance is solvable. First assume that a cut AC V exists in G whose size satisfies 

c G (A)>a. (12) 

Denote by 

S A = {v 1 e Vi; v e A} U {v 2 e V 2 ; v e V \ A} c V' (13) 

the cut in G' whose projections © to V\ and V 2 are S A = A and S 2 = V \ A, 
respectively. Since 15^1 — n and cq{A) = cg(V \ A) the conductance of S A can 
be upper bounded as 

to , = 1 („ _ Mil) < ' L5)^ ( i4) 



according to equations (JTTJ), (fT2j). and (J7|), which shows that S A is a solution of 
the Conductance instance. 

For the converse, assume that the conductance of cut ^ S C V in G' meets 

<£g'(S)<0. (15) 

Let A C V be the maximum cut in G. For cut defined according to (fT3j) we 
prove that 

$G'(£ A ) <<M<S) (16) 

which is rewritten to 

„-W|< * (2n-k- C J^Sm (17) 



2n - 4 \ n / — 2n — 4 \ k 

according to (fT^j) and (fTTj) where /c = |5*| < n and S'x, 6*2 are defined in 0. Since 
2cg(A) > Cg(Sx) + cg(S 2 ) due to A is the maximum cut in G, it suffices to show 

n - k + - (c G (5!) + cg(5 2 )) > (18) 

which follows from ^ — -r < and cg(S'i) + cg(5 < 2) < |Si| -n+ l^l -n = /cn. Thus, 



In 



^( B _^) =M ^)<^<, = _^(„_|) (19) 



holds according to (fT4^) . (fT7)|) . (fTKJ) . and 0, which implies cg(^4) > a. Hence, A 
solves the MAX CUT-3 instance. □ 



3 Local and Relative Density 

The decision version of the maximum density problem is formulated as follows: 
Maximum Density (Density) 

Instance: An undirected graph G = (V,E), positive integer k < \V\, and a 
rational number < r < 1. 

Question: Is there a subset S C V such that \S\ = k and the density of S in G 
is at least r ? 

We distinguish between Local Density and Relative Density problems 
according to the particular density measure used which is the local density (@J and 
the relative density (jSJ, respectively. Clearly, Local Density is NP-complete 
since this problem for r = 1 coincides with the NP-complete Clique problem |15j . 
Also the NP-completeness of Relative Density can easily be achieved: 

Theorem 2 Relative Density is NP-complete. 
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Proof: Obviously, Relative Density belongs to NP since a nondeterministic 
algorithm can guess a cut S C V of cardinality \S\ = k and verify Qg(S) > r 
in polynomial time. For the NP-hardness proof the following minimum bisection 
problem on cubic graphs which is known to be NP-complete |H| will be reduced 
to Relative Density in polynomial time. 

Minimum Bisection for Cubic Graphs (Min Bisection 3) 

Instance: A cubic graph G = (V, E) with n = \V\ vertices and positive integer a. 

Question: Is there a cut S CV such that \S\ = ~ and cg(S) < a? 

Given a MlN Bisection-3 instance, i.e. a cubic graph G = (V, E) with n — \V\ 
vertices, and positive integer a, a corresponding Relative Density instance 
consists of the same graph G, parameters k = | and 

3n — 2a , . 

r = . 20 

3n + 2a K J 

Now for any subset S CV such that |>S| = k = | it holds 

|E(s)| = ^_^ = 3n-^c(5) (21) 

due to G is a cubic graph, which gives 

-w-sigll (22) 

according to ©. It follows from flUJ) and (J22J) that e G (5) > r iff c G (5) < a. □ 



4 Single Cluster Editing 

The problem of deciding whether a given graph can be transformed into a collec- 
tion of cliques using at most m edge operations (both additions and deletions) 
which is called Cluster Editing is known to be NP-complete (2D]- When the 
desired solution must contain exactly p cliques, the so called p-Cluster Edit- 
ing problem remains NP-complete for every p > 2. Here we study the issue of 
whether a given graph contains a subset S of exactly k vertices such that at most 
m edge additions and deletions suffice altogether to turn S into an isolated clique: 

Minimum Single Cluster Editing (1 Cluster Editing) 

Instance: An undirected graph G = (V, E), positive integers k < \V\ and m. 

Question: Is there a subset S C V such that \S\ = k and Eg(S) < ml 

Theorem 3 1-Cluster Editing is NP-complete. 
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Proof: Obviously, 1-Cluster Editing belongs to NP since a nondeterministic 
algorithm can guess a subset S C V of cardinality \S\ = k and verify £g{S) < 
in polynomial time. For the NP-hardness proof the MlN Bisection-3 problem 
is used again (cf. the proof of Theorem 12} which will be reduced to 1-Cluster 
Editing in polynomial time. 

Given a Min Bisection-3 instance, i.e. a cubic graph G = (V,E) with 
n — \V\ vertices, and positive integer a, a corresponding 1-Cluster Editing 
instance consists of the same graph G, parameters k — | and 

12a + n (n-8) 
m = . (23) 

Now for any subset 5CV such that \S\ — k — | it holds 

£o[s) = \S\ ■ m - D _ jjgj -Cc(S) + cg(s) = Ue c (S) + n(n - 8) (M) 

according to © and (JH]). It follows from ((221) and (ED that e G (S) < m iff 
c G (5) < a. □ 



5 Conclusion 



In this paper we have presented the explicit NP-completeness proofs for the de- 
cision problems associated with the optimization of four possible graph cluster 
measures; namely the conductance, the local and relative densities, and single 
cluster editing. In clustering algorithms, combinations of fitness measures are 
often preferred as only optimizing one may result in anomalies such as select- 
ing small cliques or connected components as clusters. An open problem is the 
complexity of minimizing the product of the local and relative densities jTH] (e.g. 
their sum is closely related to the edge operation count for the single cluster 
editing problem). Another important area for further research is the complexity 
of finding related approximation solutions j2]. 
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